chemical data
ChemLLM: A Chemical Large Language Model
Zhang, Di, Liu, Wei, Tan, Qian, Chen, Jingdan, Yan, Hang, Yan, Yuliang, Li, Jiatong, Huang, Weiran, Yue, Xiangyu, Zhou, Dongzhan, Zhang, Shufei, Su, Mao, Zhong, Hansen, Li, Yuqiang, Ouyang, Wanli
Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.
- Asia > China (0.46)
- North America > United States (0.28)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education (1.00)
- (6 more...)
nach0: Multimodal Natural and Chemical Languages Foundation Model
Livne, Micha, Miftahutdinov, Zulfat, Tutubalina, Elena, Kuznetsov, Maksim, Polykovskiy, Daniil, Brundyn, Annika, Jhunjhunwala, Aastha, Costa, Anthony, Aliper, Alex, Zhavoronkov, Alex
Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (6 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
Look out for potential bias in chemical data sets
There might be disadvantages to using tried and trusted methods.Credit: Science Photo Library Like most research fields, materials science has embraced'big data', including machine-learning models and techniques. These are being used to predict new materials and properties, and devise routes to existing drugs and chemicals. But machine learning requires training data, such as those on reagents, conditions and starting materials. These are usually gleaned from the literature, and are human-generated. The choice of reagents that researchers use could come, for example, from experience or from previously published work. It might be based on a recommendation passed from supervisor to graduate student, or simply on how easy reagents are to find or buy.
- North America > United States > Pennsylvania (0.06)
- North America > United States > New York (0.06)
New AI approach bridges the 'slim-data gap' that can stymie deep learning approaches
Scientists have developed a deep neural network that sidesteps a problem that has bedeviled efforts to apply artificial intelligence to tackle complex chemistry--a shortage of precisely labeled chemical data. The new method gives scientists an additional tool to apply deep learning to explore drug discovery, new materials for manufacturing, and a swath of other applications. Predicting chemical properties and reactions among millions upon millions of compounds is one of the most daunting tasks that scientists face. There is no source of complete information from which a deep learning program could draw upon. Usually, such a shortage of a vast amount of clean data is a show-stopper for a deep learning project.
AI more accurate than animal testing for spotting toxic chemicals
Most consumers would be dismayed with how little we know about the majority of chemicals. Only 3 percent of industrial chemicals – mostly drugs and pesticides – are comprehensively tested. Most of the 80,000 to 140,000 chemicals in consumer products have not been tested at all or just examined superficially to see what harm they may do locally, at the site of contact and at extremely high doses. I am a physician and former head of the European Center for the Validation of Alternative Methods of the European Commission (2002-2008), and I am dedicated to finding faster, cheaper and more accurate methods of testing the safety of chemicals. To that end, I now lead a new program at Johns Hopkins University to revamp the safety sciences.
- Europe (0.39)
- North America > United States (0.31)
- Materials > Chemicals (1.00)
- Government > Regional Government > North America Government > United States Government (0.31)
Artificial intelligence helps with skin cancer detection
The technology has been devised at the University of Waterloo, together with a team from the Sunnybrook Research Institute. The focus is with the detection of detect melanoma skin cancer. The technology utilizes machine-learning software in order to analyze images of skin lesions. The analysis seeks to provide doctors with objective data on biological markers of melanoma. This is important since early detection of skin cancer has a high success in terms of starting treatment early, whereas late detection is far more serious.
- Health & Medicine > Therapeutic Area > Oncology > Skin Cancer (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)